In the electronic marketplace and online retail, recommender systems are widely used as decision aids. It is well know as well that, online recommendations have a big influence on many consumers' decisions. Recent studies indicate that online suggestions can manipulate consumers' preferences ratings and also also their readiness to buy certain merchandise.
Recommendation algorithms, which gather and generalize user preference patterns from recorded consumer-product interactions as pruchaces and raitings, often fall under the category of collaborative filtering.
These feedback exchanges could present consumers with unfair (or irrelevant) recommendations or underrepresented items in the input data because of various biases that may be at play.
A common hypothesis (known as ‘selfcongruence’) is that a consumer may tend to buy a product because its public impression (in our case a product image), among other alternatives, is consistent with one’s self-perceptions (user identity). Based on this assumption, the selection of human models for a product could influence a consumer’s behavior. Studies indicate that generally, there are more interactions than expected on the consumer-product segments where users’ identities match the product images (‘self-congruity’), while several market segments are underrepresented in the data. For example, (‘Large’ user, ‘Small’ product). (https://dl.acm.org/doi/pdf/10.1145/3336191.3371855)
Under this premise, we will use the Modcloth dataset, to test bias both on Fairdetect and Aequitas framework. We will look for bias in the creation of a machine learning model to predict if an marketing strategy could affect consumer's behavour resulting in a biased interaction dataset, which is commonly used as the input for modern recommendeer systems.
ModCloth is an e-commerce website which sells women’s clothing and accessories.* Many products in ModCloth include two human models with different body shapes and measurements of these models. Users can optionally provide the product sizes they purchased and fit feedback (‘Just Right’, ‘Slightly Larger’, ‘Larger’, ‘Slightly Smaller’ or ‘Smaller’) along with their reviews.
Therefore our source of bias is the dimension of human body shape. There are 2 variables of interest:
User identity: is the perception of oneself; It calculates the average size each user purchased and classify users into ‘Small’ and ‘Large’ groups based on the same standard as the product body shape image.
Product image: the public impression of a product; attributes of the human models included in the product pictures are used to generate this data set. Products with only one human model wearing a relatively small size (‘XS’, ‘S’, ‘M’ or ‘L’) are labeled as the ‘Small’ group while products with two models (an additional model wearing a plus-size: 1X’, ‘2X’, ‘3X’ or ‘4X’) are referred as the ‘Small&Large’ group
With the use of Fairdetect and Aequitas frameworks, we want to understand the existence of the association between product image and user identity in consumers’ product selections.
Congregating the various theoretical concepts into a practical framework, we can follow the “theoretical lens of a ‘sense-plan-act’ cycle”, as described by the HLEG framework (European Commission and Directorate-General for Communications Networks, Content and Technology, 2019). Applying this concept to the problem of ML fairness, we can break down three core steps in providing robust, and responsible artificial intelligence: Identify, Understand, and Act (IUA).
By understanding the philosophical forms of unfairness as defined by our review of the literature and categorizing our prominent fairness metrics into the overarching categories of representation, ability, and performance, we can establish a series of tests to “identify” levels of disparities between sensitive groups at different levels. Merging these findings with the explainability of our models through the use of white-box models, or Shapley value estimation for black-box models, we can dig deeper into the model’s predictions, “understanding” how classifications were made, and how they varied from the natural dataset exposing both natural biases as well as added model differences. Finally, by probing further into levels of misclassification, in particular looking at negative outcomes, we can isolate groups most at risk and set up a series of “actions” that can be taken to mitigate the effects. Given this three-step framework which combines societal, legal, and technical considerations, the paper will then go through a series of cases, and examine the proposed framework.
#pip install fairdetect-groupb==0.46
### Loading the AMAZING Fairdetect Class
import fairdetect_groupb
time taken in execution is : 15.418489456176758
fd = fairdetect_groupb.Fairdetect(None,None,None)
### Load other relevant libraries
import matplotlib.pyplot as plt
from random import randrange
import numpy as np
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import dalex as dx
from scipy.stats import chi2_contingency
from tabulate import tabulate
from sklearn.metrics import confusion_matrix
from scipy.stats import chisquare
from sklearn.metrics import precision_score
from __future__ import print_function
import sqlite3
# Loading our DataFrame
data = pd.read_csv('modcloth.csv')
data
| item_id | user_id | rating | timestamp | size | fit | user_attr | model_attr | category | brand | year | split | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7443 | Alex | 4 | 2010-01-21 08:00:00+00:00 | NaN | NaN | Small | Small | Dresses | NaN | 2012 | 0 |
| 1 | 7443 | carolyn.agan | 3 | 2010-01-27 08:00:00+00:00 | NaN | NaN | NaN | Small | Dresses | NaN | 2012 | 0 |
| 2 | 7443 | Robyn | 4 | 2010-01-29 08:00:00+00:00 | NaN | NaN | Small | Small | Dresses | NaN | 2012 | 0 |
| 3 | 7443 | De | 4 | 2010-02-13 08:00:00+00:00 | NaN | NaN | NaN | Small | Dresses | NaN | 2012 | 0 |
| 4 | 7443 | tasha | 4 | 2010-02-18 08:00:00+00:00 | NaN | NaN | Small | Small | Dresses | NaN | 2012 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99888 | 154797 | BernMarie | 5 | 2019-06-26 21:15:13.165000+00:00 | 6.0 | Just right | Large | Small&Large | Dresses | NaN | 2017 | 0 |
| 99889 | 77949 | Sam | 4 | 2019-06-26 23:22:29.633000+00:00 | 4.0 | Slightly small | Small | Small&Large | Bottoms | NaN | 2014 | 2 |
| 99890 | 67194 | Janice | 5 | 2019-06-27 00:20:52.125000+00:00 | NaN | Just right | Small | Small&Large | Dresses | NaN | 2013 | 2 |
| 99891 | 71607 | amy | 3 | 2019-06-27 15:45:06.250000+00:00 | NaN | Slightly small | Small | Small&Large | Outerwear | Jack by BB Dakota | 2016 | 2 |
| 99892 | 119732 | sarah | 3 | 2019-06-29 13:55:16.542000+00:00 | NaN | Just right | Small | Small | Dresses | NaN | 2016 | 2 |
99893 rows × 12 columns
data.drop(['split','timestamp','year','item_id','user_id'], axis=1,inplace= True)
# Filling in missing values with mode.
for col in ['rating','size','fit', 'category', 'brand', 'model_attr' ]:
data[col].fillna(data[col].mode()[0], inplace=True)
data['user_attr'].fillna('Small', inplace=True)
data_ae = data.copy()
data_ae['rating'] = data_ae['rating'].map({1:'bad', 2:'poor', 3:'average', 4:'great', 5:'excelent'})
data_bd = data.copy()
data_bd
| rating | size | fit | user_attr | model_attr | category | brand | |
|---|---|---|---|---|---|---|---|
| 0 | 4 | 2.0 | Just right | Small | Small | Dresses | ModCloth |
| 1 | 3 | 2.0 | Just right | Small | Small | Dresses | ModCloth |
| 2 | 4 | 2.0 | Just right | Small | Small | Dresses | ModCloth |
| 3 | 4 | 2.0 | Just right | Small | Small | Dresses | ModCloth |
| 4 | 4 | 2.0 | Just right | Small | Small | Dresses | ModCloth |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 99888 | 5 | 6.0 | Just right | Large | Small&Large | Dresses | ModCloth |
| 99889 | 4 | 4.0 | Slightly small | Small | Small&Large | Bottoms | ModCloth |
| 99890 | 5 | 2.0 | Just right | Small | Small&Large | Dresses | ModCloth |
| 99891 | 3 | 2.0 | Slightly small | Small | Small&Large | Outerwear | Jack by BB Dakota |
| 99892 | 3 | 2.0 | Just right | Small | Small | Dresses | ModCloth |
99893 rows × 7 columns
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data_bd['model_attr'] = le.fit_transform(data_bd['model_attr'])
data_bd['category'] = le.fit_transform(data_bd['category'])
data_bd['brand'] = le.fit_transform(data_bd['brand'])
data_bd['user_attr'] = le.fit_transform(data_bd['user_attr'])
data_bd['fit'] = le.fit_transform(data_bd['fit'])
data_bd
| rating | size | fit | user_attr | model_attr | category | brand | |
|---|---|---|---|---|---|---|---|
| 0 | 4 | 2.0 | 0 | 1 | 0 | 1 | 19 |
| 1 | 3 | 2.0 | 0 | 1 | 0 | 1 | 19 |
| 2 | 4 | 2.0 | 0 | 1 | 0 | 1 | 19 |
| 3 | 4 | 2.0 | 0 | 1 | 0 | 1 | 19 |
| 4 | 4 | 2.0 | 0 | 1 | 0 | 1 | 19 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 99888 | 5 | 6.0 | 0 | 0 | 1 | 1 | 19 |
| 99889 | 4 | 4.0 | 2 | 1 | 1 | 0 | 19 |
| 99890 | 5 | 2.0 | 0 | 1 | 1 | 1 | 19 |
| 99891 | 3 | 2.0 | 2 | 1 | 1 | 2 | 13 |
| 99892 | 3 | 2.0 | 0 | 1 | 0 | 1 | 19 |
99893 rows × 7 columns
#!pip install pandas-profiling
# uncomment this cell to install it
from pandas_profiling import ProfileReport
report = ProfileReport(data_bd, minimal=False)
report
help(fd.check_for_target)
Help on method check_for_target in module fairdetect_groupb:
check_for_target(data_bd) method of fairdetect_groupb.Fairdetect instance
Allows user to define the target variable out of the columns available
Checks if target variable is Binary or not. Only binary target variable accepted
Return
------
Selection of the target variable from all columns of the dataframe
target = fd.check_for_target(data_bd)
# INSERT: model_attr
['rating' 'size' 'fit' 'user_attr' 'model_attr' 'category' 'brand'] Please select target columnmodel_attr You choose wisely. Your target variable is: model_attr
from sklearn.model_selection import train_test_split
X = data_bd.drop(['model_attr'],axis=1) # axis: {0 or ‘index’, 1 or ‘columns’}, default 0
y = data_bd['model_attr']
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.8, test_size=0.2, random_state=0)
print("Data sucessfully loaded!")
Data sucessfully loaded!
import xgboost
model = xgboost.XGBClassifier().fit(X_train, y_train)
y_test_predict = model.predict(X_test)
y_test_predict
array([1, 0, 1, ..., 1, 1, 1])
bd=fairdetect_groupb.Fairdetect(model,X_test,y_test)
help(fd.get_sensitive_col)
Help on method get_sensitive_col in module fairdetect_groupb:
get_sensitive_col(X_test) method of fairdetect_groupb.Fairdetect instance
Allows user to define the sensitive variable out of the columns available
Checks if sensitive variable is Binary or not. Only binary sensitive variable accepted
Return
------
Selection for the sensitive variable from all columns of the dataframe
sensitive=bd.get_sensitive_col(X_test)
labels=bd.create_labels(sensitive,X_test)
bd.identify_bias(sensitive,labels)
#INSERT the column name of the Sensitive Variable: user_attr
#INSERT label 0: Large
#INSERT in label 1: Small
#Target Variable
#INSERT label 0: Small&Large
#INSERT in label 1: Small
Please select the sensitive column name from given features ['rating' 'size' 'fit' 'user_attr' 'category' 'brand'] Enter sensitive column here : user_attr Please Enter Label for Group 0: Large Please Enter Label for Group 1: Small Enter Target names below Please Enter name for target predicted 0: Smaa&Large Please Enter name for target predicted 1: Small REPRESENTATION
╒════════════╤═════════╤══════════╕
│ │ Large │ Small │
╞════════════╪═════════╪══════════╡
│ Smaa&Large │ 0.38072 │ 0.435343 │
├────────────┼─────────┼──────────┤
│ Small │ 0.61928 │ 0.564657 │
╘════════════╧═════════╧══════════╛
*** Reject H0: Significant Relation Between user_attr and Target with p= 1.506383135693924e-09
ABILITY
Accept H0: True Positive Disparity Not Detected. p= 0.48434499286000354
Accept H0: False Positive Disparity Not Detected. p= 0.4041440496472899
Accept H0: True Negative Disparity Not Detected. p= 0.23594324170925463
* Reject H0: Significant False Negative Disparity with p= 0.08288170782370083
PREDICTIVE
Accept H0: No Significant Predictive Disparity. p= 0.7076922048343487
#Storing the most recent p-value out of the True Positive Disparity Rate, the False Positive Disparity Rate, the True Negative Rate and the False Negative Disparity Rate in variables
p_TPR = bd.TPR_p
p_FPR = bd.FPR_p
p_TNR = bd.TNR_p
p_FNR = bd.FNR_p
The analysis is split in 3 parts in order to identify areas of bias. It will break down identification of bias into 3 sessions
REPRESENTATON: Comparison of the sensitive attribute User Identity Group (Large=0 and Small=1) and the target variable (Model Attribute: Product Image: Small=0, Small&large=1).
Demographic Parity: association of the target variable v.s. the sensitive variable
P-Value: Reject the null Hipothesis of non significance relation between the sensitive variable and the target variable. This means THERE IS a relationship between the user identity and the product image.
ABILITY: Analysing specific sensitive groups Regardless of the sensitive background, there should be a 50/50 ratio in a fair scenario.
In the false negative rate (FNR) there is a significant difference betweek large and small,large enough to reject the null hypothesis, meaning there is presense of false negative disparity and there is a bigger chances of marketing missrepresentaton of users identified with large sizes than users identified with small sizes
PREDICTION: The model is not further exacervating the anyobserveb bias. Whatever is present in the dataset is observed in the predicitions.
from itertools import product
for i,j in product([0,1],repeat=2):
print("Visualization for affected_group = {0} and affected_target = {1}".format(i,j))
bd.understand_shap(labels,sensitive,i,j)
Visualization for affected_group = 0 and affected_target = 0
ntree_limit is deprecated, use `iteration_range` or model slicing instead.
Model Importance Comparison
Affected Attribute Comparison Average Comparison to True Class Members
Average Comparison to All Members
Random Affected Decision Process Visualization for affected_group = 0 and affected_target = 1 Model Importance Comparison
Affected Attribute Comparison Average Comparison to True Class Members
Average Comparison to All Members
Random Affected Decision Process Visualization for affected_group = 1 and affected_target = 0 Model Importance Comparison
Affected Attribute Comparison Average Comparison to True Class Members
Average Comparison to All Members
Random Affected Decision Process Visualization for affected_group = 1 and affected_target = 1 Model Importance Comparison
Affected Attribute Comparison Average Comparison to True Class Members
Average Comparison to All Members
Random Affected Decision Process
Affected Group = 0 = Large Affected Target = 0 = Small
Brand is the most significant variable for both user identity types, large and small, however is relatively more important for small size users than large size users.
Fit is twice as relevant for users dentified with large sizes compared to people identified with small sizes.
For the affected group and target, most relevant variable is category, followed by size and fit, which is significantly more relevant for this group than for the entire population
The disparate impact is a metric to evalute the fairness. It differentiates between an unpriviliged/unfavoured group and a priviliged/favoured group. In the calculation is the proporition of the unpriviliged/unfavoured group that receives the positive outcome of the event, divided by the proporiton of priviliged group that receiveds the positive outcome.
The results is interpreted by the four-fifths rule: if the unprivileged group has a less positive outcome than 80% compared to the priviliged group there a diparate impact violation.
In this analysis we also defined that a ratio of 80-90% is a sign of mild impact violation. A ratio of 1 is an indication for perfect equality.
bd.disparate_impact(sensitive,labels)
Disparate Impact, Sensitive vs. Predicted Target: 0.9117959843400442 The disparate impact ratio indicated complete equality between the favoured and unfavoured group
0.9117959843400442
In case we want to calculate further with the KPIs of the fairdetect method at a later stage and do not want to call of the fuctions above, we could store the intermediate results in SQLite.
#First important relevant methods to run sql alchemy
import sqlalchemy as db
from sqlalchemy import create_engine
from sqlalchemy import inspect
from sqlalchemy import Column, Integer, DateTime
import datetime
class store_pvalues:
def __init__(self,p_TPR,p_FPR,p_TNR,p_FNR):
self.p_TRP,self.p_FPR,self.p_TNR,self.p_FNR = p_TPR,p_FPR,p_TNR,p_FNR
def dynamic_data_entry(self,p_TPR,p_FPR,p_TNR,p_FNR):
"""
Method to insert pvalues of TPR, FRP, TNR and FNR in dynamic way into a table
Parameter
-------
p_TPR: p value of True Positive Rate calculation
p_FPR: p value of False Positive Rate calculation
p_TNR: p value of True Negative Rate calculation
p_FNR: p value of False Negative Rate calculation
"""
query = db.insert(pvalues).values(Id=123, p_TPR = self.p_TRP, p_FPR = self.p_FPR, p_TNR = self.p_TNR, p_FNR = self.p_FNR)
ResultProxy = connection.execute(query)
We first create the database credict card approval for all tables related to the dataset. Then we connect to the database and create the table pvalues that should store the pvalues of TPR, TNR, FPR and FNR.
engine = db.create_engine('sqlite:///credit_card_approval.db') #Create credit_card approval.sqlite automatically
connection = engine.connect() #connect database
metadata = db.MetaData()
pvalues = db.Table('pvalues', metadata,
db.Column('Id', db.Integer()),
db.Column('date', db.types.DateTime(timezone=True), default=datetime.datetime.utcnow),
db.Column('p_TPR', db.Float(), default=100.0),
db.Column('p_FPR', db.Float(), default=100.0),
db.Column('p_TNR', db.Float(), default=100.0),
db.Column('p_FNR', db.Float(), default=100.0)
)
metadata.create_all(engine) #Command to create the table
Through the inspector method it is possible to check which tables are in the respective database and which columns the tables have
inspector = inspect(engine)
connection = engine.connect()
#printing the name of the table to check if it was successfully created
print(inspector.get_table_names())
['pvalues']
#printing the columns
print(pvalues.columns.keys())
['Id', 'date', 'p_TPR', 'p_FPR', 'p_TNR', 'p_FNR']
Furthermore we can also look at the metadata of the table and see what type of columns it has.
#showing the metadata of the sql alchemy table
print(repr(metadata.tables['pvalues']))
Table('pvalues', MetaData(), Column('Id', Integer(), table=<pvalues>), Column('date', DateTime(timezone=True), table=<pvalues>, default=ColumnDefault(<function datetime.utcnow at 0x00000222B13091F0>)), Column('p_TPR', Float(), table=<pvalues>, default=ColumnDefault(100.0)), Column('p_FPR', Float(), table=<pvalues>, default=ColumnDefault(100.0)), Column('p_TNR', Float(), table=<pvalues>, default=ColumnDefault(100.0)), Column('p_FNR', Float(), table=<pvalues>, default=ColumnDefault(100.0)), schema=None)
After the table was successfully created, we follow the same methodology as in fairdetect to first create the object of data entry and then enter the dynamic data itself.
p_value_object = store_pvalues(p_TPR,p_FPR,p_TNR,p_FNR)
We then call the object with the respective method of data entry and hand over the most recent p-values of TRP, FPR, TNR, FNR that resulted from the fairdetect analysis
insert_data = p_value_object.dynamic_data_entry(p_TPR,p_FPR,p_TNR,p_FNR)
In order to check if the data was inserted we check for the most recent data entries in the database of credict card approval and print the latest five results
query = db.select([pvalues])
ResultProxy = connection.execute(query)
ResultSet = ResultProxy.fetchall()
ResultSet[:5]
[(123, datetime.datetime(2022, 7, 30, 7, 13, 10, 864000), 0.7159221552842456, 0.588169391004731, 0.9319949919140254, 0.1040665041132579), (123, datetime.datetime(2022, 7, 30, 8, 8, 39, 111772), 0.7159221552842456, 0.588169391004731, 0.9319949919140254, 0.1040665041132579), (123, datetime.datetime(2022, 7, 30, 8, 24, 17, 213982), 0.7159221552842456, 0.588169391004731, 0.9319949919140254, 0.1040665041132579), (123, datetime.datetime(2022, 7, 30, 8, 32, 49, 809242), 0.7159221552842456, 0.588169391004731, 0.9319949919140254, 0.1040665041132579), (123, datetime.datetime(2022, 7, 30, 9, 7, 29, 758131), 0.48434499286000354, 0.4041440496472899, 0.23594324170925463, 0.08288170782370083)]
We are now ready to utilize AEQUITAS to detect bias.
The Aequitas toolkit is a flexible bias-audit utility for algorithmic decision-making models, accessible via Python API. It will help us out evaluate the performance of the model across several bias and fairness metrics. Here are the steps involved:
1) Understand where biases exist in Modcloth dataset and in the model
2) Compare the level of bias between groups in our sample population (bias disparity)
3) Assess model Fairness and Visualize absolute bias metrics and their related disparities for rapid comprehension and decision-making
As with any Python program, the first step will be to import the necessary packages. Below we import several components from the Aequitas package. We also import some other non-Aequitas useful packages.
#!pip install aequitas
import pandas as pd
import seaborn as sns
from aequitas.group import Group
from aequitas.bias import Bias
from aequitas.fairness import Fairness
from aequitas.plotting import Plot
import aequitas.plot as ap
import warnings; warnings.simplefilter('ignore')
%matplotlib inline
Now that we've identified the protected attribute 'user_attr' and defined privileged and unprivileged values, we can use AEQUITAS to detect bias in the dataset
from sklearn.model_selection import train_test_split
df_ae = data_ae.copy()
X_ae = df_ae.drop(["model_attr"],axis=1) # axis: {0 or ‘index’, 1 or ‘columns’}, default 0
y_ae = data_bd['model_attr']
X_train_ae, X_test_ae, y_train_ae, y_test_ae = train_test_split(X_ae,y_ae,train_size=0.8, test_size=0.2, random_state=0)
print("Data sucessfully loaded!")
Data sucessfully loaded!
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
pipe = make_pipeline(OneHotEncoder(handle_unknown="ignore"),
DecisionTreeClassifier(min_samples_leaf=10))
pipe.fit(X_train_ae, y_train_ae)
Pipeline(steps=[('onehotencoder', OneHotEncoder(handle_unknown='ignore')),
('decisiontreeclassifier',
DecisionTreeClassifier(min_samples_leaf=10))])
data_ae_full = X_test_ae.copy().reset_index(drop=True)
data_ae_full["score"] = pipe.predict_proba(X_test_ae)[:, 1]
data_ae_full["label_value"] = y_test_ae.copy().reset_index(drop=True)
data_ae_full.head(n=4)
| rating | size | fit | user_attr | category | brand | score | label_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | average | 4.0 | Slightly small | Small | Dresses | ModCloth | 0.625000 | 0 |
| 1 | excelent | 2.0 | Just right | Small | Bottoms | ModCloth | 0.495407 | 0 |
| 2 | excelent | 7.0 | Just right | Large | Dresses | ModCloth | 0.691964 | 0 |
| 3 | excelent | 5.0 | Just right | Small | Bottoms | ModCloth | 0.391960 | 1 |
data_ae_small = data_ae_full[["fit", "user_attr", "category", "rating","score", "label_value"]].copy()
data_ae_small.head(n=5)
| fit | user_attr | category | rating | score | label_value | |
|---|---|---|---|---|---|---|
| 0 | Slightly small | Small | Dresses | average | 0.625000 | 0 |
| 1 | Just right | Small | Bottoms | excelent | 0.495407 | 0 |
| 2 | Just right | Large | Dresses | excelent | 0.691964 | 0 |
| 3 | Just right | Small | Bottoms | excelent | 0.391960 | 1 |
| 4 | Just right | Small | Dresses | excelent | 0.680506 | 0 |
What is the distribution of groups, predicted scores, and labels across my dataset?
Aequitas’s Group() class enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup, calculating commonly used metrics such as false positive rate and false omission rate, as well as counts by group and group prevelance among the sample population.
from aequitas.group import Group
g = Group()
xtab, _ = g.get_crosstabs(data_ae_small)
absolute_metrics = g.list_absolute_metrics(xtab)
# View calculated counts across sample population groups
xtab[[col for col in xtab.columns if col not in absolute_metrics]]
| model_id | score_threshold | k | attribute_name | attribute_value | pp | pn | fp | fn | tn | tp | group_label_pos | group_label_neg | group_size | total_entities | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | binary 0/1 | 232 | fit | Just right | 111 | 14543 | 1 | 8321 | 6222 | 110 | 8431 | 6223 | 14654 | 19979 |
| 1 | 0 | binary 0/1 | 232 | fit | Slightly large | 19 | 2185 | 0 | 1288 | 897 | 19 | 1307 | 897 | 2204 | 19979 |
| 2 | 0 | binary 0/1 | 232 | fit | Slightly small | 70 | 2210 | 5 | 1231 | 979 | 65 | 1296 | 984 | 2280 | 19979 |
| 3 | 0 | binary 0/1 | 232 | fit | Very large | 10 | 430 | 1 | 259 | 171 | 9 | 268 | 172 | 440 | 19979 |
| 4 | 0 | binary 0/1 | 232 | fit | Very small | 22 | 379 | 0 | 159 | 220 | 22 | 181 | 220 | 401 | 19979 |
| 5 | 0 | binary 0/1 | 232 | user_attr | Large | 42 | 3651 | 2 | 2247 | 1404 | 40 | 2287 | 1406 | 3693 | 19979 |
| 6 | 0 | binary 0/1 | 232 | user_attr | Small | 190 | 16096 | 5 | 9011 | 7085 | 185 | 9196 | 7090 | 16286 | 19979 |
| 7 | 0 | binary 0/1 | 232 | category | Bottoms | 80 | 4633 | 0 | 2278 | 2355 | 80 | 2358 | 2355 | 4713 | 19979 |
| 8 | 0 | binary 0/1 | 232 | category | Dresses | 94 | 6706 | 6 | 3901 | 2805 | 88 | 3989 | 2811 | 6800 | 19979 |
| 9 | 0 | binary 0/1 | 232 | category | Outerwear | 46 | 1434 | 0 | 830 | 604 | 46 | 876 | 604 | 1480 | 19979 |
| 10 | 0 | binary 0/1 | 232 | category | Tops | 12 | 6974 | 1 | 4249 | 2725 | 11 | 4260 | 2726 | 6986 | 19979 |
| 11 | 0 | binary 0/1 | 232 | rating | average | 44 | 2235 | 3 | 1229 | 1006 | 41 | 1270 | 1009 | 2279 | 19979 |
| 12 | 0 | binary 0/1 | 232 | rating | bad | 15 | 742 | 0 | 380 | 362 | 15 | 395 | 362 | 757 | 19979 |
| 13 | 0 | binary 0/1 | 232 | rating | excelent | 106 | 10745 | 1 | 6283 | 4462 | 105 | 6388 | 4463 | 10851 | 19979 |
| 14 | 0 | binary 0/1 | 232 | rating | great | 42 | 4891 | 1 | 2755 | 2136 | 41 | 2796 | 2137 | 4933 | 19979 |
| 15 | 0 | binary 0/1 | 232 | rating | poor | 25 | 1134 | 2 | 611 | 523 | 23 | 634 | 525 | 1159 | 19979 |
#View calculated absolute metrics for each sample population group
xtab[['attribute_name', 'attribute_value'] + absolute_metrics].round(2)
| attribute_name | attribute_value | tpr | tnr | for | fdr | fpr | fnr | npv | precision | ppr | pprev | prev | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fit | Just right | 0.01 | 1.00 | 0.57 | 0.01 | 0.00 | 0.99 | 0.43 | 0.99 | 0.48 | 0.01 | 0.58 |
| 1 | fit | Slightly large | 0.01 | 1.00 | 0.59 | 0.00 | 0.00 | 0.99 | 0.41 | 1.00 | 0.08 | 0.01 | 0.59 |
| 2 | fit | Slightly small | 0.05 | 0.99 | 0.56 | 0.07 | 0.01 | 0.95 | 0.44 | 0.93 | 0.30 | 0.03 | 0.57 |
| 3 | fit | Very large | 0.03 | 0.99 | 0.60 | 0.10 | 0.01 | 0.97 | 0.40 | 0.90 | 0.04 | 0.02 | 0.61 |
| 4 | fit | Very small | 0.12 | 1.00 | 0.42 | 0.00 | 0.00 | 0.88 | 0.58 | 1.00 | 0.09 | 0.05 | 0.45 |
| 5 | user_attr | Large | 0.02 | 1.00 | 0.62 | 0.05 | 0.00 | 0.98 | 0.38 | 0.95 | 0.18 | 0.01 | 0.62 |
| 6 | user_attr | Small | 0.02 | 1.00 | 0.56 | 0.03 | 0.00 | 0.98 | 0.44 | 0.97 | 0.82 | 0.01 | 0.56 |
| 7 | category | Bottoms | 0.03 | 1.00 | 0.49 | 0.00 | 0.00 | 0.97 | 0.51 | 1.00 | 0.34 | 0.02 | 0.50 |
| 8 | category | Dresses | 0.02 | 1.00 | 0.58 | 0.06 | 0.00 | 0.98 | 0.42 | 0.94 | 0.41 | 0.01 | 0.59 |
| 9 | category | Outerwear | 0.05 | 1.00 | 0.58 | 0.00 | 0.00 | 0.95 | 0.42 | 1.00 | 0.20 | 0.03 | 0.59 |
| 10 | category | Tops | 0.00 | 1.00 | 0.61 | 0.08 | 0.00 | 1.00 | 0.39 | 0.92 | 0.05 | 0.00 | 0.61 |
| 11 | rating | average | 0.03 | 1.00 | 0.55 | 0.07 | 0.00 | 0.97 | 0.45 | 0.93 | 0.19 | 0.02 | 0.56 |
| 12 | rating | bad | 0.04 | 1.00 | 0.51 | 0.00 | 0.00 | 0.96 | 0.49 | 1.00 | 0.06 | 0.02 | 0.52 |
| 13 | rating | excelent | 0.02 | 1.00 | 0.58 | 0.01 | 0.00 | 0.98 | 0.42 | 0.99 | 0.46 | 0.01 | 0.59 |
| 14 | rating | great | 0.01 | 1.00 | 0.56 | 0.02 | 0.00 | 0.99 | 0.44 | 0.98 | 0.18 | 0.01 | 0.57 |
| 15 | rating | poor | 0.04 | 1.00 | 0.54 | 0.08 | 0.00 | 0.96 | 0.46 | 0.92 | 0.11 | 0.02 | 0.55 |
False Positive Rate (FPR) is the fraction of individuals who identified with a large size the model misclassifies with small model imange. FPR is quite low accross all groups and labels.
False Negative Rate (FNR) is the fraction of individuals identified with a small body size and the model misclassifies with a small & large size model image. 1 groups raise our concerns here: very small size fit group seem to be given very offten a wrong model image compare to the rest of the fit attributes.
False Discovery Rate (FDR) is the fraction of individuals who the model predicts to have an small size but for whom the their individual perception of their body is large. Very large size fit seems to be impacted in this category.
The chart below displays group metric predictive positive rate (ppr) calculated across each attribute, colored based on number of samples in the attribute group.
We can see from the longer bars that across ‘rating’, ‘user_attribute’, and ‘fit’ attributes, the groups Modcloth incorrectly predicts as small & large image profile most often are rated excellent,with ajust right judgement in fit and identified as small size. From the darker coloring, we can also tell that these are the three largest populations in the data set
from aequitas.plotting import Plot
aqp = Plot()
ppr = aqp.plot_group_metric(xtab, 'ppr')
Extremely small group sizes increase standard error of estimates, and could be factors in prediction error such as false negatives, hence we are using the min_group parameter to vizualize only those sample population groups above a user-specified percentage of the total sample size.
PPR1 = aqp.plot_group_metric(xtab, 'ppr', min_group_size=0.05)
We use the Aequitas Bias() class to calculate disparities between groups based on the crosstab returned by the Group() class get_crosstabs() method described above.
Disparities are calculated as a ratio of a metric for a group of interest compared to a base group.
from aequitas.bias import Bias
b = Bias()
bdf = b.get_disparity_predefined_groups(xtab,
original_df=data_ae_small,
ref_groups_dict={'category':'Dresses','user_attr':'Small', 'fit':'Just right', 'rating':'excelent'},
alpha=0.05,
check_significance=False)
get_disparity_predefined_group()
The treemap below displays precision disparity values calculated using a predefined group, in this case the ‘Small’ group within the user_attr attribute, sized based on the group size and colored based on disparity magnitude. The farther from 1 the more disparity exist among the groups
ppr_disparity = aqp.plot_disparity(bdf, group_metric='ppr_disparity',
attribute_name='user_attr')
ppr_disparity = aqp.plot_disparity(bdf, group_metric='ppr_disparity',
attribute_name='fit')
ppr_disparity = aqp.plot_disparity(bdf, group_metric='ppr_disparity',
attribute_name='rating')
from aequitas.fairness import Fairness
f = Fairness()
fdf = f.get_group_value_fairness(bdf)
The chart below displays absolute group metric Predicted Positive Rate Disparity (ppr) across each attribute, colored based on fairness determination for that attribute group (green = ‘True’ and red = ‘False’).
We can see from the green color that only the excelent rating, small user attribute category, and just right fit groups have been determined to be fair. These are the groups selected as reference groups, so this model is not fair in terms of Statistical Parity for any of the other groups.
ppr_fairness = aqp.plot_fairness_group(fdf, group_metric='ppr', title=True)
The Parity Test graphs is another visualization aid that help us dentify where the bias is based on a defined disparty tolerance tha can be adjusted according to the results a research desires to evaluate. Fot the sake of this exercise we have selected a disparity tolerance of 1.25
As per the results we can observed for each one of the categories in the Mondcloth, what is the level of disparity compared to the reference group for each one of the variables
metrics = ['for', 'fnr', 'npv', 'precision', 'ppr']
disparity_tolerance = 1.25
ap.summary(bdf, metrics, fairness_threshold = disparity_tolerance)
ap.disparity(bdf, metrics, 'fit', fairness_threshold = disparity_tolerance)
ap.disparity(bdf, metrics, 'user_attr', fairness_threshold = disparity_tolerance)
ap.disparity(bdf, metrics, 'rating', fairness_threshold = disparity_tolerance)